Classification by Tree 1 Running head: Classification by Tree Classification Based on Tree-Structured Allocation Rules

نویسندگان

  • Brandon Vaughn
  • Qiu Wang
چکیده

We consider the problem of classifying an unknown observation into one of several populations using tree-structured allocation rules. Although many parametric classification procedures are robust to certain assumption violations, there is need for discriminant procedures that can be utilized regardless of the group-conditional distributions that underlie the model. The treestructured allocation rule will be discussed. Finally, Monte Carlo results are reported to observe the performance of the rule in comparison to a discriminant and logistic regression analysis. Classification by Tree 3 Classification Based on Tree-Structured Allocation Rules Purpose Many areas in educational and psychological research involve the use of classification statistical analysis. For example, school districts might be interested in attaining variables that provide optimal prediction of school dropouts. In psychology, a researcher might be interested in the classification of a subject into a particular psychological construct. The purpose of this study is to investigate alternative procedures to classification other than the use of discriminant and logistic regression analysis. A nonparametric classification rule will be examined, and misclassifications compared to equivalent discriminant and logistic regression analyses. Theoretical Framework The problem of classifying an observation arises in many areas of educational practice. Multivariate discriminant analysis is a commonly used procedure, as is logistic regression. However, each procedure has various assumptions which should be met for proper application of the analysis. Discriminant analysis typically has the more stringent assumptions: multivariate normality of the classification variables and equal covariance matrices among the groups (Hair, Anderson, Tatham, & Black, 1988). Logistic regression is recommended in the cases where the multivariate normality assumption is not met (Tabachnick & Fidell, 2001). Yet, use of a traditional logistic regression approach requires the researcher to limit classification to only two groups. Also, logistic regression tends to require larger sample sizes for stable results due to the maximum likelihood approach (Fan & Wang, 1999). And both techniques are not meant to handle analysis of complex nonlinear data sets. Thus, there is need for a nonparametric Classification by Tree 4 classification rule. This paper will consider the issue of classification using a nonparametric tree-structured method (Breiman, Friedman, Olshen, & Stone, 1984). Brief Review of the Tree Method The goal of classification trees is to predict or explain responses on a categorical dependent variable from their measurements on one or more predictor variables. Tree-structure rules are constructed by repeated splits of predictor variables into two or more subsets. The final subsets form a partition of the predictor variables. Here is a simple illustration of a classification tree: imagine that you want to devise a system for sorting coins (pennies, nickels, etc.) into different groups. You wish to devise a hierarchical systems for sorting the coins, so you look for a measurement on which the coins differ − such as diameter. If you construct a device with various slots cut (first for the smallest (dime), then next smallest (penny), and so on), you can roll the coin down the track. If the coin falls through the first slot, you would classify the coin as a dime. Otherwise, it would continue down the track until it falls through a particular slot. This would be the construction of a classification tree. The decision process used by this classification tree provides an effective method for sorting coins. The use of classification and regression trees is an increasingly popular method in modern classification analysis. The methodology has many advantages (Breiman, Friedman, Oshen, Stone, 1984): • It is a nonparametric technique which does not require distributional assumptions. • It can be used for both exploratory and confirmatory analyses easily. • It can be used with data sets that are complex in nature. • It is robust with respect to outliers. • It can handle data with missing independent variables better than traditional classification methods. Classification by Tree 5 Our consideration of the regression tree approach will focus on the traditional method of construction labeled CART (Classification and Regression Trees) (Breiman et al, 1984). As in the case of linear regression and discriminant function analyses, the analysis requires data on the attributes (or independent variables) and the classification outcome (or dependent variable). Unlike linear regression analysis, where the outcome is a prediction equation, the outcome of CART is a tree, specifically a binary tree. A binary tree consists of a set of sequential binary decisions, applied to each case, that leads to further binary decisions or to a final classification of that case. Each partition is represented by a node on the tree. The independent variables can be either qualitative (nominal, ordinal) or quantitative (interval, ratio) variables, which provides great flexibility for possible analyses. Figure 1 shows an example of a classification tree for a medical data set involving survival analysis (Loh & Shih, 1997). ------------------------------------Insert Figure 1 about here. ------------------------------------The measurement of p predictor variables of the entity is notated by the p-dimensional vector ( 1, , p ) x x ′ = x ... . If x is an ordered variable, this approach searches over all possible values of c for splits in the form of x c ≤ . A case is sent to the left subnode if the inequality is satisfied, and to the right subnode if not. If x is a categorical variable, the search is over all splits of the form x A ∈ , where A is a non-empty subset of the set of values taken from x. The CART procedure actually computes many competing trees and then selects an optimal one as the final tree. This is done, optionally, in the context of a “10-fold crossvalidation” procedure (see Breiman et al., 1984, Chapter 11) whereby 1/10 of the data is held Classification by Tree 6 back and a classification tree is grown. The procedure is repeated nine times and the final tree obtained by taking into consideration the ten different trees. The fit of the tree to the data, that is, how well it classifies cases, is measured by a misclassification table for the chosen tree. A resultant tree can be used to classify new cases where the dependent variable is not available. Given a classification tree, new cases are “filtered down” the tree to a final classification. For the example in Figure 2, the researchers were interested in the classification of teachers by high and low response rates on surveys (Shim, Felner, Shim, Brand, & Gu, 1999). ------------------------------------Insert Figure 2 about here. ------------------------------------Decisions about which direction the data goes within the tree structure are based upon whether cases meet the specific criterion of the node. Among the predictor variables used in the model, percentage of students eligible for free lunch was selected as the first branches of the tree. That is, the first decision of Node 1 is based on the percentage of students eligible for free lunch. The following question is posed: “Is the percentage of students eligible for free lunch 25.9% or less?” If we continue down the tree on the left, we note the next question asks: “Is the percentage of students eligible for free lunch 10.4% or less?” If the answer is “yes,” then the case is deposited in the left terminal node. All terminal modes on the left are classified as “high return rate,” while terminal nodes on the right are classified as “low return rate.” This methodology of classification is carried out for all tree paths. For example of a “low return rate” classification, we note that cases where the percentage of students eligible for free lunch fall between 10.4% and 25.9% and the total number of students are more than 478.5 are deposited in a right terminal node. Similar to discriminant analysis, we can evaluate the fit of the model by Classification by Tree 7 examining the cross-validated misclassification table, which shows joint occurrence of actual and predicted classification and probability. The generation of a classification tree generally involves four steps (StatSoft, 2004): 1. Specifying the criteria for predictive accuracy. 2. Selecting splits. 3. Determining when to stop splitting. 4. Choosing the “right-sized” tree. Like in the case of discriminant analysis, one can use the concept of priors to establish the criteria for predictive accuracy. One may also consider misclassification costs and case weights, which go beyond the simple idea of minimizing misclassification rates and beyond the scope of this instructional module. We consider the typical specification of priors as being proportional to the class sizes. Splits are selected one at a time using a variety of methods, such as CART, QUEST, and so on. We consider only CART in the current paper. Using the CART method, all possible splits for each predictor variable at each subset are examined to find the split producing the largest improvement in goodness of fit. For categorical predictor variables with k levels present at a subset, there are 2 1 possible contrasts between two sets of levels of the predictor. For ordered predictors with k distinct levels present at a subset, there are k -1 midpoints between distinct levels. To determine the improvement in goodness of fit, the developers of CART (Breiman et al., 1984) suggest using a measure called the Gini index. This index is given by

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assessing Behavioral Patterns of Motorcyclists Based on Traffic Control Device at City Intersections by Classification Tree Algorithm

According to the forensic statistics, in Iran, 26 percent of those killed in traffic accidents are motorcyclists in recent years. Thus, it is necessary to investigate the causes of motorcycle accidents because of the high number of motorcyclist casualties. Motorcyclists' dangerous behaviors are among the causes of events that are discussed in this study. Traffic signs have the important role of...

متن کامل

Comparison of Machine Learning Algorithms for Broad Leaf Species Classification Using UAV-RGB Images

Abstract: Knowing the tree species combination of forests provides valuable information for studying the forest’s economic value, fire risk assessment, biodiversity monitoring, and wildlife habitat improvement. Fieldwork is often time-consuming and labor-required, free satellite data are available in coarse resolution and the use of manned aircraft is relatively costly. Recently, unmanned aeria...

متن کامل

Comparison of Decision Tree and Naïve Bayes Methods in Classification of Researcher’s Cognitive Styles in Academic Environment

In today world of internet, it is important to feedback the users based on what they demand. Moreover, one of the important tasks in data mining is classification. Today, there are several classification techniques in order to solve the classification problems like Genetic Algorithm, Decision Tree, Bayesian and others. In this article, it is attempted to classify researchers to “Expert” and “No...

متن کامل

Comparison of Decision Tree and Naïve Bayes Methods in Classification of Researcher’s Cognitive Styles in Academic Environment

In today world of internet, it is important to feedback the users based on what they demand. Moreover, one of the important tasks in data mining is classification. Today, there are several classification techniques in order to solve the classification problems like Genetic Algorithm, Decision Tree, Bayesian and others. In this article, it is attempted to classify researchers to “Expert” and “No...

متن کامل

Comparison of Performance in Image Classification Algorithms of Satellite in Detection of Sarakhs Sandy zones

Extended abstract 1- Introduction Wind erosion as an “environmental threat” has caused serious problems in the world. Identifying and evaluating areas affected by wind erosion can be an important tool for managers and planners in the sustainable development of different areas.  nowadays there are various methods in the world for zoning lands affected by wind erosion. One of the most important...

متن کامل

Steel Buildings Damage Classification by damage spectrum and Decision Tree Algorithm

Results of damage prediction in buildings can be used as a useful tool for managing and decreasing seismic risk of earthquakes. In this study, damage spectrum and C4.5 decision tree algorithm were utilized for damage prediction in steel buildings during earthquakes. In order to prepare the damage spectrum, steel buildings were modeled as a single-degree-of-freedom (SDOF) system and time-history...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005